ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

h-vetinari · 2018-06-26T21:20:23Z

I followed the recommendation of @jreback there to change duplicated, and to make the output a tuple if return_inverse=True.

I didn't find a test for the behavior of DataFrame.duplicated() -- there's a test of Index.duplicated() and Series.duplicated() in tests/test_base.py, but nothing I could find in tests/frame/.

So I added a file to test to test current behavior, as well as the new kwarg. Notably, the backend between the different Series in the output tuple is different (pandas._libs.hashtable.duplicated_int64 vs. np.unique), and so I also added several tests that these results are exactly compatible.

jschendel · 2018-06-26T22:07:04Z

Can you add a whatsnew entry to 0.24.0?

Also, looks like we have asvs for DataFrame.duplicated. Might be nice to add methods for return_inverse=True to these. Not sure if this is strictly necessary though.

pandas/asv_bench/benchmarks/frame_methods.py

Lines 412 to 429 in 838e8da

    
           class Duplicated(object): 
        
               goal_time = 0.2 
        
               def setup(self): 
        
                   n = (1 << 20) 
        
                   t = date_range('2015-01-01', freq='S', periods=(n // 64)) 
        
                   xs = np.random.randn(n // 64).round(2) 
        
                   self.df = DataFrame({'a': np.random.randint(-1 << 8, 1 << 8, n), 
        
                                        'b': np.random.choice(t, n), 
        
                                        'c': np.random.choice(xs, n)}) 
        
                   self.df2 = DataFrame(np.random.randn(1000, 100).astype(str)).T 
        
               def time_frame_duplicated(self): 
        
                   self.df.duplicated() 
        
               def time_frame_duplicated_wide(self): 
        
                   self.df2.duplicated()

jschendel · 2018-06-26T22:07:22Z

pandas/core/frame.py

+              compatible with ``return_inverse``.
+        return_inverse boolean, default False
+            Determines whether the mapping from unique elements to the original
+            index should be returned. If true, the output is a tuple.


Can you add a .. versionadded:: 0.24.0 to this?

jschendel · 2018-06-26T22:12:00Z

pandas/tests/frame/test_duplicates.py

+import pandas.util.testing as tm
+
+
+class TestDataFrameDuplicated(object):


I think these tests should be in tests/frame/test_analytics.py. It looks like there are a couple DataFrame.duplicated tests related to bugs there, though nothing related to basic behavior, e.g.

pandas/pandas/tests/frame/test_analytics.py

Lines 1634 to 1649 in 4c4579d

@pytest.mark.slow

def test_duplicated_do_not_fail_on_wide_dataframes(self):

# gh-21524

# Given the wide dataframe with a lot of columns

# with different (important!) values

data = {'col_{0:02d}'.format(i): np.random.randint(0, 1000, 30000)

for i in range(100)}

df = pd.DataFrame(data).T

result = df.duplicated()

# Then duplicates produce the bool pd.Series as a result

# and don't fail during calculation.

# Actual values doesn't matter here, though usually

# it's all False in this case

assert isinstance(result, pd.Series)

assert result.dtype == np.bool

The Series.duplicated tests are also in the associated tests/series/tests_analytics.py file, so makes sense to mirror that setup.

jschendel · 2018-06-26T22:14:34Z

pandas/tests/frame/test_duplicates.py

+
+        # keep = 'first'
+        exp = Series([False, False, True, False, True])
+        assert_series_equal(df.duplicated(keep='first'), exp)


My preference would be to use a slightly more explicit pattern throughout these tests along the lines of:

expected = ... result = ... assert_series_equal(expected, result)

codecov · 2018-06-27T00:07:25Z

Codecov Report

Merging #21645 into master will decrease coverage by 0.14%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21645      +/-   ##
==========================================
- Coverage   92.18%   92.04%   -0.15%     
==========================================
  Files         169      169              
  Lines       50820    50818       -2     
==========================================
- Hits        46850    46776      -74     
- Misses       3970     4042      +72

Flag	Coverage Δ
#multiple	`90.45% <100%> (-0.15%)`	⬇️
#single	`42.19% <13.2%> (-0.19%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`95.41% <100%> (-0.04%)`	⬇️
pandas/core/series.py	`93.73% <100%> (-0.02%)`	⬇️
pandas/core/base.py	`96.99% <100%> (-0.62%)`	⬇️
pandas/core/frame.py	`97.21% <100%> (+0.01%)`	⬆️
pandas/core/algorithms.py	`94.8% <100%> (+0.11%)`	⬆️
pandas/core/indexes/category.py	`97.28% <100%> (+0.01%)`	⬆️
pandas/core/indexes/base.py	`96.45% <100%> (+0.05%)`	⬆️
pandas/io/formats/console.py	`65.15% <0%> (-10.61%)`	⬇️
pandas/errors/__init__.py	`92.3% <0%> (-7.7%)`	⬇️
pandas/core/dtypes/base.py	`92.68% <0%> (-7.32%)`	⬇️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44a9b16...2906b70. Read the comment docs.

WillAyd

Thanks for the PR

WillAyd · 2018-06-27T05:18:51Z

pandas/tests/frame/test_duplicates.py

+
+class TestDataFrameDuplicated(object):
+
+    def test_duplicated_keep(self):


Can you parametrize this test?

WillAyd · 2018-06-27T05:19:15Z

pandas/tests/frame/test_duplicates.py

+        df = DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']})
+
+        # keep = 'first'
+        exp = Series([False, False, True, False, True])


As a pseudo-standard would be preferable to name this expected instead of exp

WillAyd · 2018-06-27T05:19:45Z

pandas/tests/frame/test_duplicates.py

+
+        # keep = 'first'
+        exp = Series([False, False, True, False, True])
+        assert_series_equal(df.duplicated(keep='first'), exp)


Along the same lines do result = df.duplicated(...) on a previous line the simply assert_series_equal(result, expected)

WillAyd · 2018-06-27T05:20:56Z

pandas/tests/frame/test_duplicates.py

+        assert_series_equal(df.duplicated(keep=False), exp)
+
+    def test_duplicated_nan_none(self):
+        # np.nan and None are considered equal


Hmm is this expected?

This is how it works currently - stuff like this becomes visible when writing tests. ;-)

Of course someone can open a separate issue on this.

Opened #21720

WillAyd · 2018-06-27T05:22:08Z

pandas/tests/frame/test_duplicates.py

+        assert_frame_equal(reconstr, df)
+
+        # keep = False
+        rgx = 'The parameters return_inverse=True and keep=False cannot be.*'


Make this a separate test called test_duplicated_inverse_raises

WillAyd · 2018-06-27T05:23:18Z

pandas/core/frame.py


        Returns
        -------
-        duplicated : Series
+        duplicated : Series or tuple of Series if return_inverse is True


What's the design consideration in going (Series, Series) instead of say (Series, ndarray)? Wondering if the latter wouldn't be preferable since it AFAICT it would mostly be used for indexing operations

It's absolutely essential that it's a Series (otherwise it'd have to be two arrays at least), because - other than in numpy - we also need to keep track of the original index.

So, the second Series here links two arrays - which original index gets mapped to which index in the deduplicated one. This is how the reconstruction can succeed:

df == unique.reindex(inv.values).set_index(inv.index)

I added examples to the docstrings that illustrate the usage (as well as the need to keep track of the index of the deduplicated object in case of Series/DataFrame).

The index is still available in the first output, so it is not that essential that the second output is a Series (there is still a design issue for sure)

WillAyd · 2018-06-27T05:25:06Z

pandas/core/frame.py

+            ids = ids[::-1]  # np.unique takes first occurrence as unique value
+            _, o2u, u2o = np.unique(ids, return_inverse=True,
+                                    return_index=True)
+            inv = Series(self.index[::-1][o2u][u2o][::-1], index=self.index)


Can you express this in some other fashion or at the very least comment it? The double negative step logic is tough to understand at least at first glance

h-vetinari · 2018-06-27T06:20:37Z

@jschendel @WillAyd
Thanks for the feedback, working to incorporate all the comments.

Another question that I realised while writing the whatsnew entry - should this kwarg be made available to Series/Index as well? In this case I'm guessing I'd have to adapt core.base.IndexOpsMixin.duplicated?

h-vetinari · 2018-06-27T06:58:18Z

@jschendel Re:

Also, looks like we have asvs for DataFrame.duplicated. Might be nice to add methods for return_inverse=True to these. Not sure if this is strictly necessary though.

Haven't ever worked with the asv's yet - will have a look later.

h-vetinari · 2018-06-27T17:40:04Z

I added three runs in the ASV, here are the results against current master. I don't know why frame_methods.Duplicated.time_frame_duplicated failed on master, that shouldn't be the case - all the others are explained by the missing return_inverse kwarg on master.

Edit: ran again on a clean, restarted machine with no background applications.

  before           after         ratio
[a620e725]       [094139bf]
       203ms            203ms     1.00  frame_methods.Duplicated.time_frame_duplicated
-     78.1ms           70.3ms     0.90  frame_methods.Duplicated.time_frame_duplicated_mixed
      failed            156ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed_inverse
      failed            125ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed_inverse_last
       203ms            203ms     1.00  frame_methods.Duplicated.time_frame_duplicated_wide
      failed            203ms      n/a  frame_methods.Duplicated.time_frame_duplicated_wide_inverse

pep8speaks · 2018-06-27T17:40:50Z

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 31, 2018 at 08:00 Hours UTC

mroeschke · 2018-06-27T18:55:32Z

asv_bench/benchmarks/frame_methods.py

+        # will take fastpath for no duplicates
+        self.df2.duplicated(return_inverse=True)
+
+    def time_frame_duplicated_mixed(self):


Instead of the 3 separate benchmarks below, you can parameterize over the return_inverse and keep keyword arguments.

Here's an example: https://github.com/h-vetinari/pandas/blob/094139bffa07c101882bbfbf35a1621eadc00adc/asv_bench/benchmarks/frame_methods.py#L219-L231

Thanks for the hint, coming in the commit for the next feedback round.

Is it normal that the times are then not split up depending on the parameter?

I get:

[...] [ 16.67%] ··· Running frame_methods.Duplicated.time_frame_duplicated 469ms;... [ 33.33%] ··· Running frame_methods.Duplicated.time_frame_duplicated_mixed 125ms;... [ 50.00%] ··· Running frame_methods.Duplicated.time_frame_duplicated_wide 203ms;... [...]

I believe the semicolons are splitting the time depending on the parameter. The format usually looks better with asv continuous

@mroeschke The output I posted is an extract from asv continuous, but I hadn't looked into asv compare yet, where the results were hiding. :)

h-vetinari · 2018-06-28T05:53:06Z

Test failure in circleci is unrelated. Failed in TestPythonParser.test_dtype_with_converter with

E           AssertionError: Caused unexpected warning(s): ['ResourceWarning'].

h-vetinari · 2018-06-28T07:18:27Z

@mroeschke
I got the results now, although I had to add a temporary if-condition (otherwise master would stumble over the unknown kwarg). Will remove that condition again, but I just wanted to have the same hash show up in the comparison as well as the commit I just pushed. Here are the results:

   before      after       ratio
 [a620e725]  [36b03a1c]
      203ms       203ms     1.00  frame_methods.Duplicated.time_frame_duplicated('first', False)
     failed       578ms      n/a  frame_methods.Duplicated.time_frame_duplicated('first', True)
+     203ms       250ms     1.23  frame_methods.Duplicated.time_frame_duplicated('last', False)
     failed       438ms      n/a  frame_methods.Duplicated.time_frame_duplicated('last', True)
     78.1ms      78.1ms     1.00  frame_methods.Duplicated.time_frame_duplicated_mixed('first', False)
     failed       125ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed('first', True)
     78.1ms      78.1ms     1.00  frame_methods.Duplicated.time_frame_duplicated_mixed('last', False)
     failed       125ms      n/a  frame_methods.Duplicated.time_frame_duplicated_mixed('last', True)
      219ms       203ms     0.93  frame_methods.Duplicated.time_frame_duplicated_wide('first', False)
     failed       203ms      n/a  frame_methods.Duplicated.time_frame_duplicated_wide('first', True)
      203ms       203ms     1.00  frame_methods.Duplicated.time_frame_duplicated_wide('last', False)
     failed       203ms      n/a  frame_methods.Duplicated.time_frame_duplicated_wide('last', True)

To me, the benchmarks seem a bit unstable, otherwise I don't know where the performance differences come from.

jreback

still have yet to review the actual code

jreback · 2018-06-28T10:38:01Z

doc/source/whatsnew/v0.23.1.txt

@@ -1,7 +1,7 @@
 .. _whatsnew_0231:

-v0.23.1
-------
+v0.23.1 (June 12, 2018)


don't modify things, separate PR if you would like to change this

jreback · 2018-06-28T10:38:57Z

doc/source/whatsnew/v0.24.0.txt

+``DataFrame.duplicated`` has gained the ``return_inverse`` kwarg
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Previously, there was no way to determine how duplicate rows in a ``DataFrame`` got mapped to the deduplicated, unique subset. This made it hard to push back


too much explanation. make this a sentence or (all of the text). you don't need to explain why

jreback · 2018-06-28T10:39:15Z

doc/source/whatsnew/v0.24.0.txt

+                      index=[1, 4, 9, 16, 25])
+    df
+    isdup, inv = df.duplicated(return_inverse=True)  # default: keep='first'
+    isdup


spell out names

jreback · 2018-06-28T10:39:38Z

doc/source/whatsnew/v0.24.0.txt

+
+.. ipython:: python
+
+    unique = df.loc[~isdup]  # same as df.drop_duplicates()


this example is confusing because you are using different code here

Not sure I understand. It refers to the same objects as in the code block above - it's just a continuation.

h-vetinari · 2018-06-29T17:27:16Z

As I already mentioned above, it became obvious to me while writing the whatsnew entry that this kwarg should be enabled for Series/Index as well. The last commit contains a working version but no tests yet -- ran into troubles with #21684. Additional xref #21685

This is already much more elegant than the previous commits, because it does all the work for return_inverse in one spot (pandas.core.algorithms), and all the calling objects just need to wrap around that.

In summary, the following is still lacking:

tests for Series/Index/MultiIndex
asv for Series/Index/MultiIndex
update whatsnew

h-vetinari · 2018-07-05T15:17:30Z

I finally finished adding all the tests, turns out it's quite complicated to do anything for Index, haha. ;-)

The core algorithm changes almost vanish compared to the amount of test changes. Some of the things I added tests for were not really tested so far (bare duplicated was only tested implicitly - and then only sometimes - for stuff like drop_duplicates). I also split up pandas/tests/indexes/multi/test_unique_and_duplicates.test_duplicated, which had a large amount of tests in one.

ASVs should be in line with what I posted above, because as long as return_inverse=False, I never calculate anything additional.

I needed to implement a different signature for pandas.core.algorithms.duplicated, which is inspired by np.unique, in the sense that it has both return_index and return_inverse (with the same meanings). This method is not public as far as I can tell? In any case, the docstring reflects the changes.
Finally, I added a stabilize-kwarg, which stabilises against the implicit sorting that numpy does. It's on by default, but turning it off is a performance improvement in case the index/inverse are immediately plugged together again (as is the case for Series/DataFrame).

Tried to explain all the code as well as possible, but happy to receive feedback (and/or answer questions).

jreback · 2018-07-05T22:19:50Z

pandas/core/algorithms.py

-        - False : Mark all duplicates as ``True``.
+        - False : Mark all duplicates as ``True``. This option is not
+          compatible with ``return_index`` or ``return_inverse``.
+    return_index : boolean, default False


pls pls let's keep the scope to the original. just return_inverse. If you want to do more things, make them PR's on top.

h-vetinari · 2018-07-06T07:57:39Z

pls pls let's keep the scope to the original. just return_inverse. If you want to do more things, make them PR's on top.

@jreback I did keep strictly with the scope of just adding return_inverse to duplicated. The issue is that Series/DataFrame need a slightly different treatment than Index:

the inverse for Series/DataFrame needs to take into account two indexes (the original index, and the one of the deduplicated data). This needs to combine the outputs return_index=True and return_inverse=True from np.unique.
the inverse for Index is just a simple np.ndarray, which would in principle just need return_inverse=True from np.unique, except we need to undo the implicit sorting that np.unique does.

There are two options from my POV:

implement two different duplicated-methods for Series/DataFrame vs. Index, which would however overlap if return_inverse=False.
(current state of PR) implement one central duplicated-method, but with enough flexibility to cover both cases, and have thin wrappers around that. Since -- AFAIU -- pandas.core.algorithms.duplicated is not public API, I decided to add that extra flexibility.

Does that make the situation clearer? I don't mind implementing option 1, but it would lead to code duplication and would IMHO be messier.

h-vetinari · 2018-07-06T09:02:30Z

Ugh, resolving conflicts on windows is such a pain (conflict resolution commits not recognised by git). Had to do a massive rebase...

Also added some examples to the separate duplicated-docstrings.

h-vetinari · 2018-09-06T08:42:46Z

@WillAyd I incorporated your feedback - any open questions?

@jreback care to take a look?

jorisvandenbossche

Didn't look in detail to the implementation, but some general questions:

What is the reason you decided to add it to duplicated and not drop_duplicates ? (because from quickly looking at the issue, and from the docstring examples, that is actually what you want?)
I also ask because I think it would be a more natural fit there, as it is not literally an "inverse" for the result of duplicated (as it is for drop_duplicates or as for unique in numpy)
I think we should have more discussion (or at least motivation for the current design) about the actual return value (series vs array, index labels vs positional integers). Always returning an array of integers would also be an option (and is eg robust for duplicates in the index)

jorisvandenbossche · 2018-09-06T13:15:02Z

doc/source/whatsnew/v0.24.0.txt

@@ -159,6 +159,52 @@ This is the same behavior as ``Series.values`` for categorical data. See
 :ref:`whatsnew_0240.api_breaking.interval_values` for more.


+.. _whatsnew_0240.enhancements.duplicated_inverse:
+
+The `duplicated`-method has gained the `return_inverse` kwarg


double backticks were you now use single backticks (also on the lines below)

(in rst, double backticks give code-styled text)

jorisvandenbossche · 2018-09-06T13:35:23Z

pandas/core/frame.py

+            - False : Mark all duplicates as ``True``. This option is not
+              compatible with ``return_inverse``.
+        return_inverse : boolean, default False
+            If True, also return the selection from the index from the


"selection from the index" sounds strange to me, and it does not really say what the values are (I understand what you mean, but I think we should think about a better wording).
Also the following lines are rather complex to understand (eg the "boolean complement of the first output")

jorisvandenbossche · 2018-09-06T13:36:29Z

pandas/core/frame.py


        Returns
        -------
-        duplicated : Series
+        duplicated : Series or tuple of Series if return_inverse is True


The index is still available in the first output, so it is not that essential that the second output is a Series (there is still a design issue for sure)

h-vetinari · 2018-09-06T14:31:35Z

@jorisvandenbossche

What is the reason you decided to add it to duplicated and not drop_duplicates ? (because from quickly looking at the issue, and from the docstring examples, that is actually what you want?)
I also ask because I think it would be a more natural fit there, as it is not literally an "inverse" for the result of duplicated (as it is for drop_duplicates or as for unique in numpy)

I wanted to add .unique method in #21357, but got told by @jreback that:

-1 on adding any new methods, but would be ok with adding a return_inverse kwarg (clearly defaulting to False), to .duplicated() to easily recovering the indexer.

I still think it makes sense for duplicated, which is invoked by drop_duplicates, and so would need to support that anyway if one wanted to add it to drop_duplicates.

I think we should have more discussion (or at least motivation for the current design) about the actual return value (series vs array, index labels vs positional integers). Always returning an array of integers would also be an option (and is eg robust for duplicates in the index)

numpy returns two arrays of indexes, one relative to the original set and one relative to the unique values. I think this information fundamentally belongs together and since pandas supports the index/values distinction, I strongly believe that they should live in the same object. I don't like reusing the index of duplicated itself, because then I need to keep around two separate objects to reconstruct. But happy to discuss alternative approaches.

h-vetinari · 2018-09-11T06:29:49Z

@jorisvandenbossche
I incorporated your feedback. Since no compelling alternatives have come up for the return type, I left that aspect as-is.

h-vetinari · 2018-09-18T00:17:08Z

@jreback @WillAyd @jorisvandenbossche

How can I/we move this PR forward?

jorisvandenbossche · 2018-09-18T09:00:33Z

@h-vetinari sorry for the slow answer

I think my main worry is that we are adding a return_inverse keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.

Numpy's unique is more comparable to drop_duplicates. Therefore I think such a keyword would be a more natural fit there.

numpy returns two arrays of indexes

Can you clarify this? You are talking about np.unique? (but I think that only returns a single array of indices?)

Side note: it might make sense to add this to pd.unique / Series.unique as well? (not necessarily at the same time; or might actually be an easier starter)

h-vetinari · 2018-09-18T09:26:38Z

@jorisvandenbossche

I think my main worry is that we are adding a return_inverse keyword which actually does not return the inverse for that function (it does return the inverse for another function), and that it is in name similar to numpy's keyword, but in usage also different.

Fair point. I would prefer a df.unique as well, but was following what I was told in #21357

-1 on adding any new methods, but would be ok with adding a return_inverse kwarg (clearly defaulting to False), to .duplicated() to easily recovering the indexer.

which is how I implemented it here. Could still be added to drop_duplicates as well, but since that depends on duplicated, we'd need it there anyway.

Re:

numpy returns two arrays of indexes

Can you clarify this? You are talking about np.unique? (but I think that only returns a single array of indices?)

https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html has both return_indices and return_inverse, and both are necessary to reconstruct from the pandas side (the difference for us being that we also need to keep track of the index, whereas numpy just has arange as index).

Side note: it might make sense to add this to pd.unique / Series.unique as well? (not necessarily at the same time; or might actually be an easier starter)

Agreed. Actually, for me the best solution would be:

add df.unique (could be a thin wrapper around drop_duplicates)
have df.unique, series.unique, index.unique with return_inverse-kwarg
dispatch from pd.unique to appropriate code branch

jreback

i haven't looked at this in any detail, there is a lot going on here.

jreback · 2018-09-23T20:16:20Z

pandas/core/algorithms.py

+
+    if keep == 'first':
+        # o2u: original indices to indices of ARRAY of unique values
+        # u2o: reduplication from array of unique values to original array


I'd rather you not use np.unique at all, its not as performant as pd.unique, doesn't handle all dtypes, and sorts.

Further pls, pls use actual names here, and really avoid using abbreviations in any library code.

@jreback

I'd rather you not use np.unique at all, its not as performant as pd.unique, doesn't handle all dtypes, and sorts.

Yes, it would be nicer to have this implemented in the cython hashtable functions, but that performance improvement is for a follow-up. np.unique is an easy solution and is invoked only for return_inverse=True (and we're only calling it on a series of ints, not objects, because the factorization for that hasn't changed!).

Further pls, pls use actual names here, and really avoid using abbreviations in any library code.

There's a fully commented, thoroughly explained and very localized part where these appear. Not sure how this is unclear, but will adapt...

Finally, the core changes are minimal (but I understand that it looks like a lot). TLDR: Implementation moves to core.algorithms, everything else wraps around that (+ doc improvements). Tests are expanded, and ASVs added.

jorisvandenbossche · 2018-09-24T22:20:52Z

Fair point. I would prefer a df.unique as well, but was following what I was told in #21357

I have wanted a DataFrame.unique for a long time, but the question is how it would be different compared to drop_duplicates?

Anyway, yes @jreback put you on the path to add it to duplicated as a first idea, but I am questioning that choice. So regardless of previous recommendation, we should still discuss that (@jreback can you chime in here?)

numpy.unique has both return_indices and return_inverse, and both are necessary to reconstruct from the pandas side

I think the numpy's return_indices is rather unrelated here? It is about the index into the original array to get the unique ones, while you want to preserve the index of the original array?

Agreed. Actually, for me the best solution would be:

add df.unique (could be a thin wrapper around drop_duplicates)
have df.unique, series.unique, index.unique with return_inverse-kwarg
dispatch from pd.unique to appropriate code branch

Adding a return_inverse to the existing unique function / methods would certainly be a welcome change I think.

jreback · 2018-09-24T22:43:00Z

i am not against adding this to .unique, though as you have mentioned that is indeed just drop_duplicates

h-vetinari · 2018-09-24T22:44:23Z

@jorisvandenbossche
Thanks for your answer:

I have wanted a DataFrame.unique for a long time, but the question is how it would be different compared to drop_duplicates?

I would suggest that unique just acts as a thin wrapper around drop_duplicates. The latter has been in the pandas API forever, but the former name is even more established, and I think they could peacefully coexist.

I think the numpy's return_indices is rather unrelated here? It is about the index into the original array to get the unique ones, while you want to preserve the index of the original array?

Assuming np.unique didn't sort (which I have to laboriously undo in this PR) and keep='first', we would have:

isduplicated = some_series.duplicated()
_, values2unique, unique2values = np.unique(some_series.values,
                                            return_index=True, return_inverse=True)
inverse = Series(some_series.index[values2unique][unique2values], index=some_series.index)

and neither values2unique (the indices) nor unique2values (the inverse) can be omitted. Again, the difference to numpy here is that some_series can have an index different from arange(len(some_series)), which is why we need to first select the right indices of the index (to get the uniques) and the broadcast them back to the full shape to get the full inverse. If we didn't care about the index of some_series, then the indices would indeed be not necessary, but then we wouldn't be able to use the inverse in .loc or .reindex of the uniques (=some_series.loc[~isduplicated]), because the indexes wouldn't match.

Adding a return_inverse to the existing unique function / methods would certainly be a welcome change I think.

I had already collected some thoughts for this during the day and opened a (largely orthogonal to this PR) issue for this: #22824

EDIT: clarified mechanics for inverse

jorisvandenbossche · 2018-09-25T12:50:19Z

Assuming np.unique didn't sort (which I have to laboriously undo in this PR)

That is because for the inverse you rely on np.unique instead of our own implementations?
Another option might be to actually rely on our own algos for that? (I assume the hashtables already support that, as eg we need it for factorizing. Which led me think: pd.unique with a return_inverse argument is actually basically the same as pd.factorize ?)

I see how you use both return_index and return_inverse there, but that is to construct the inverse (so which would be hidden for the user), I thought you were speaking about the actual API for return_inverse towards the user. So therefore the confusion.

h-vetinari · 2018-09-25T20:21:16Z

@jorisvandenbossche

That is because for the inverse you rely on np.unique instead of our own implementations?
Another option might be to actually rely on our own algos for that?
(I assume the hashtables already support that, as eg we need it for factorizing. Which led me think: pd.unique with a return_inverse argument is actually basically the same as pd.factorize ?)

When I started this, I just saw factorize being used to create the IDs which were then fed again into a hashtable. I looked into the hashtables code, but the duplicated_{dtype} methods had no return_inverse and I didn't want to start right away by messing in the templated cython. So I started with np.unique, which is not great (due to having to undo the sorting), but is otherwise easy to start with. The adaptation of the cython code was/would have been a follow-up for me.

Looking at the implementation of factorize, you are right, there is a get_labels of the corresponding table. This is still not immediately usable though, because (AFAICT), it has no way to distinguish between keep='first'|'last' (which is relevant even if there's no index!); and because of the different na-handling of factorize that actually leads to subtle bugs in duplicated, see #21720.

I could work around the former (like I do here for np.unique), but the latter would break the behaviour of Series.unique for mixed NaNs in object columns. And it doesn't seem as easy as setting a different na_value in factorize, as the docstring warns:

na_value : object, optional
        A value in `values` to consider missing. Note: only use this
        parameter when you know that you don't have any values pandas would
        consider missing in the array (NaN for float data, iNaT for
        datetimes, etc.).

Maybe need to add a way for factorize to deal with different NaNs first?

I see how you use both return_index and return_inverse there, but that is to construct the inverse (so which would be hidden for the user), I thought you were speaking about the actual API for return_inverse towards the user. So therefore the confusion.

Just to make sure we're on the same page: I need the return_index of np.unique internally, but do not suggest to expose that in pandas API. I only used it as an argument that we have an extra layer of information (the index) to keep track of, and why the output should be a Series (containing, in some transformed way, both the return_index and return_inverse from np.unique)

h-vetinari · 2018-12-03T17:37:51Z

Alright, I'm closing this for now, as the lack of cython-inverse has now been solved by #23400, but for .unique rather than .duplicated. Therefore I'll follow that line of development (it's also closer to the intention of #21357).

jschendel reviewed Jun 26, 2018

View reviewed changes

jschendel added the Enhancement label Jun 26, 2018

jschendel added this to the 0.24.0 milestone Jun 26, 2018

WillAyd requested changes Jun 27, 2018

View reviewed changes

h-vetinari force-pushed the unique_reverse branch from 9083a7b to d3b5d0a Compare June 27, 2018 17:40

h-vetinari force-pushed the unique_reverse branch from 293eace to 094139b Compare June 27, 2018 18:33

mroeschke reviewed Jun 27, 2018

View reviewed changes

jreback requested changes Jun 28, 2018

View reviewed changes

This was referenced Jun 29, 2018

ENH: set_index for Series #21684

Open

ENH/API: allow different values for labels/index in DF.reindex __OR__ raise error #21685

Open

h-vetinari mentioned this pull request Jul 3, 2018

BUG: df.duplicated treats None as np.nan in object columns #21720

Open

h-vetinari force-pushed the unique_reverse branch from 819bc9a to 80f5881 Compare July 5, 2018 15:04

h-vetinari force-pushed the unique_reverse branch from 80f5881 to 8acfe88 Compare July 5, 2018 15:27

jreback requested changes Jul 5, 2018

View reviewed changes

h-vetinari force-pushed the unique_reverse branch from 8acfe88 to b7d6077 Compare July 6, 2018 09:00

h-vetinari force-pushed the unique_reverse branch from b7d6077 to 93bd97f Compare July 6, 2018 09:04

h-vetinari force-pushed the unique_reverse branch from bd4430e to 18a0de7 Compare August 31, 2018 08:00

jorisvandenbossche reviewed Sep 6, 2018

View reviewed changes

h-vetinari added 2 commits September 11, 2018 08:19

Merge remote-tracking branch 'upstream/master' into unique_reverse

aa36fff

Review (jorisvandenbossche)

db8693f

h-vetinari mentioned this pull request Sep 12, 2018

TST/CLN: break up & parametrize tests for df.set_index #22236

Merged

h-vetinari added 2 commits September 13, 2018 14:19

Add reference to method in whatsnew

d4e803e

Merge remote-tracking branch 'upstream/master' into unique_reverse

738048e

jreback requested changes Sep 23, 2018

View reviewed changes

h-vetinari added 2 commits September 23, 2018 23:40

Review (jreback)

b8cbb13

Merge remote-tracking branch 'upstream/master' into unique_reverse

2906b70

h-vetinari mentioned this pull request Sep 24, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

h-vetinari mentioned this pull request Oct 3, 2018

CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986

Merged

h-vetinari mentioned this pull request Oct 28, 2018

ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

Merged

jreback removed this from the 0.24.0 milestone Nov 6, 2018

h-vetinari closed this Dec 3, 2018

This was referenced Dec 3, 2018

ENH: coalesce-method (upgrade for update/combine_first) #22812

Open

API/ENH/DEPR: Series.unique returns Series #24108

Closed

h-vetinari mentioned this pull request Oct 12, 2019

API: add return_inverse to pd.unique #24119

Closed

4 tasks

		import pandas.util.testing as tm


		class TestDataFrameDuplicated(object):

	@pytest.mark.slow
	def test_duplicated_do_not_fail_on_wide_dataframes(self):
	# gh-21524
	# Given the wide dataframe with a lot of columns
	# with different (important!) values
	data = {'col_{0:02d}'.format(i): np.random.randint(0, 1000, 30000)
	for i in range(100)}
	df = pd.DataFrame(data).T
	result = df.duplicated()

	# Then duplicates produce the bool pd.Series as a result
	# and don't fail during calculation.
	# Actual values doesn't matter here, though usually
	# it's all False in this case
	assert isinstance(result, pd.Series)
	assert result.dtype == np.bool


		class TestDataFrameDuplicated(object):

		def test_duplicated_keep(self):


		.. ipython:: python

		unique = df.loc[~isdup] # same as df.drop_duplicates()

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Conversation

h-vetinari commented Jun 26, 2018

jschendel commented Jun 26, 2018 • edited Loading

Choose a reason for hiding this comment

jschendel Jun 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 27, 2018 • edited Loading

Codecov Report

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Jun 27, 2018

h-vetinari commented Jun 27, 2018

h-vetinari commented Jun 27, 2018 • edited Loading

pep8speaks commented Jun 27, 2018 • edited Loading

Comment last updated on August 31, 2018 at 08:00 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Jun 28, 2018

h-vetinari commented Jun 28, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Jun 29, 2018 • edited Loading

h-vetinari commented Jul 5, 2018

Choose a reason for hiding this comment

h-vetinari commented Jul 6, 2018

h-vetinari commented Jul 6, 2018

h-vetinari commented Sep 6, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Sep 6, 2018

h-vetinari commented Sep 11, 2018

h-vetinari commented Sep 18, 2018

jorisvandenbossche commented Sep 18, 2018

h-vetinari commented Sep 18, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Sep 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 24, 2018

jreback commented Sep 24, 2018

h-vetinari commented Sep 24, 2018 • edited Loading

jorisvandenbossche commented Sep 25, 2018

h-vetinari commented Sep 25, 2018

h-vetinari commented Dec 3, 2018

jschendel commented Jun 26, 2018 •

edited

Loading

jschendel Jun 26, 2018 •

edited

Loading

codecov bot commented Jun 27, 2018 •

edited

Loading

h-vetinari commented Jun 27, 2018 •

edited

Loading

pep8speaks commented Jun 27, 2018 •

edited

Loading

h-vetinari commented Jun 29, 2018 •

edited

Loading

h-vetinari commented Sep 18, 2018 •

edited

Loading

h-vetinari Sep 23, 2018 •

edited

Loading

h-vetinari commented Sep 24, 2018 •

edited

Loading